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Abstract 

The Nystrom method is an efficient technique 
to speed up large-scale learning applications 
by generating low-rank approximations. Cru- 
cial to the performance of this technique is 
the assumption that a matrix can be well 
approximated by working exclusively with a 
subset of its columns. In this work we re- 
late this assumption to the concept of ma- 
trix coherence and connect matrix coherence 
to the performance of the Nystrom method. 
Making use of related work in the compressed 
sensing and the matrix completion literature, 
we derive novel coherence-based bounds for 
the Nystrom method in the low-rank set- 
ting. We then present empirical results that 
corroborate these theoretical bounds. Fi- 
nally, we present more general empirical re- 
sults for the full-rank setting that convinc- 
ingly demonstrate the ability of matrix co- 
herence to measure the degree to which in- 
formation can be extracted from a subset of 
columns. 



1 Introduction 

Modern problems in computer vision, natural lan- 
guage processing, computational biology and other 
areas often involve datasets containing millions 
of training instances. However, several standard 
metho ds in machine le arning, such as spectral clus- 
tering ( Ng et al\ 2001 ), ma nifold lea rning techniques 



200 % iScholkopf et al. 



( de Silva and Tenenbaunu . 

1998h . kernel ridge regression (jSaunders et all . Il998h 
or other kernel-based algorithms do not scale to 
such orders of magnitude. In fact, even storage of 
the matrices associated with these datasets can be 
problematic since they are often not sparse and hence 
the number of entries is extremely large. As shown 



by IWilliams and Seeger the Nystrom method 

provides an attractive solution when working with 
large-scale datasets by operating on only a small 
part of the original matrix to generate a low-rank 
approximation. The Nystrom method has been shown 
to work well in practice for various applications 
ranging from manifold learning to image segmentation 
dFowlkes et all 12004 fPlattl 12004 iTalwalkar eTal. 



20081; IZhang et gq|2008ft . 



The effectiveness of the Nystrom method hinges on 
two key assumptions on the input matrix, G. First, 
we assume that a low-rank approximation to G can be 
effective for the task at hand. This assumption is often 
true empirically as evidenced by the widespread use 
of singular value decomposition (SVD) and principal 
component analysis (PCA) in practical applications. 
As expected, the Nystrom method is not appropriate 
in cases where this assumption does not hold, which 
explains its poor perf o rman ce in the experimental re- 



sults of Fergus et 



( 20091) . Previous work analyz- 
ing the performance of the Nystrom method incorpo- 
rates this low-rank assumption into theoretical guaran- 
tees by comparing the Nystrom approximation to the 
'best' low-rank approximation, i.e., the approximation 
constructed from the top singular values and singu- 
lar vectors of G (see Section [ 2] for further discussi on) 



( Drineas and Mahonev . 2005 : lKumar et al ., 2009a) 



The second crucial assumption of the Nystrom method 
involves the sampling-based nature of the algorithm, 
namely that an accurate low-rank approximation can 
be generated exclusively from information extracted 
from a small subset of I -^i n columns of G. This 
assumption is not generally true for all matrices. For 
instance, consider the extreme case of the nxn matrix 
described below: 



G 



ei 



e r 



(1) 



where e*i is the zth column of the n dimensional iden- 
tity matrix and is the n dimensional zero vector. 
Although this matrix has rank r, it nonetheless can- 
not be well approximated by a random subset of I 
columns unless this subset includes e±,...,e r . In 
order to account for such pathological cases, previ- 
ous theoretical bounds relied on sampling columns 
of G from a non-uniform distribution weighted pre- 
cis ely by the magnitud e of the diagonal elements of 
G dBelabbas and Wolfd . [200l iDrineas and Mahonevl . 
20051 ). Indeed, these bounds give better guarantees 
for pathological cases. However, in practice, when 
working with real- world d atasets, uniform sampling is 



more co mmonly used, e.g ., Fowlkes et al. 120041); Piatt 



(2004); iTalwalkar et all (|2008l ); IWilliams and Seeger 



2000|), since diagonal sampling is more expensive 



and does not t ypicall y outperform uniform sampling 



( Ku mar et al. , 2009c). Hence the diagonal sampling 
bounds are not applicable in this setting. Further- 
more, these bounds are typically loose for matrices in 
which the diagonal entries of the matrix are roughly 
of the same magnitude, as in the case of all kernel 
matrices generated from RBF kernels, for which the 
Nystrom has been noted to work particularly well 
(j Williams and Seegerl . lioooh . 



In this work, we propose to characterize the ability to 
extract information from a small subset of I columns 
using the notion of matrix coherence, an alternative 
data-dependent measurement which we believe to be 
intrinsically related to the algorithm's perform. Co- 
herence measures the extent to which the singular vec- 
tors of a matrix are correlated with the standard basis. 
Intuitively, if we work with sufficiently incoherent ma- 
trices, then we avoid pathological cases such as the one 
presented {1} . Recent work on compressed sensing and 
matrix completion, which also involve sampling-based 
approxima tions, have relied heavily on coherence as- 
sumptions ( Candes and Romberg!. 2007 ; Candes et al . 



20061 ; iDonohol . 12009 ) 



The main contribution of this work is the connec- 
tion that is made between matrix coherence and the 
Nystrom method. Making use of related work in the 
compressed sensing and the matrix completion liter- 
ature, we give a more refined analysis of this algo- 
rithm as a function of matrix coherence, presenting a 
novel preliminary theoretical bound for the Nystrom 
method. We also present extensive empirical results 
that strongly relate coherence to the performance of 
the Nystrom method. 

The remainder of the paper is organized as follows. 
Section [2] introduces basic definitions of coherence and 
gives a brief presentation of the Nystrom method. In 
Section [3] we present our novel bound for the Nystrom 
method under low-rank, low-coherence assumptions. 



SectionlUpresents extensive empirical studies that sup- 
port our bound and illustrate a similar connection be- 
tween matrix coherence and the performance of the 
Nystrom method for full-rank matrices. Our empiri- 
cal results also show that incoherence assumptions are 
valid for several datasets derived from real-world ap- 
plications. 

2 Preliminaries 

Let G € R" xn be a symmetric positive semidefinite 
(SPSD) matrix. SPSD matrices, such as Gram or 
kernel matrices, often appear in the context of ma- 
chine learning. For any Gram matrix, there exists an 
N and X e R Nxn such that G = X T X. We de- 
fine XV\j = 1.. . n, as the jth column vector of X 
and Xitf, i = 1 . . .N, as the ith row vector of X, and 
denote by ||-|| the 1% norm of a vector. Using sin- 
gular value decomposition (SVD), the Gram matrix 
can be written as G — VY.V T , where V is orthonor- 
mal and S = diag(tri, ...,cr n ) is a real diagonal ma- 
trix with diagonal entries sorted in decreasing order. 
For r = rank(G), the pseudo-inverse of G is defined 

as G+ = E[=i^t _1 ^ (t) ^ (t)T - Further, for k < r, 
Gk = J2t=i a tV^V^ )T is the 'best' rank-fc approx- 
imation to G, or the rank-fc matrix with minimal ||-||f 
distance to G, where ||-J|f denotes the Frobcnius norm 
of a matrix. 

2.1 Nystrom method 

The Nystrom method was presented in 
Williams and Seeger (2000) to speed up the per- 
formance of kernel machines. This is done by 
generating low-rank approximations of G using a 
subset of the columns of the matrix. Suppose we 
randomly sample I <C n columns of G uniformly with 
replacement, and let G be the n x I matrix of these 
sampled columns. Then, without loss of generality, 
we can rearrange the columns and rows of G based 
on this sampling and define X = [X\ Xq\ where 



Xi € 



pNxl 



such that 



G 



X T X 



and G 



W X{ X 2 
Xj Xi Xj Xi 

w 

XjXi 



(2) 
(3) 



where W = Xj Xi. The Nystrom approximation is 
now defined as: 



G w G = GW+C 



(4) 



The Frobenius distance between G and G, ||G — G\\f, 
is one standard measurement of the accuracy of the 



Nystrom method. The runtime of this algorithm is 
0(Z 3 + nl 2 ): 0(l 3 ) for SVD on W and 0(nl 2 ) for mul- 
tiplication with G. The Nystrom method is often pre- 
sented with an additional step whereby W in (0| is 
replaced by its rank-k approximation, W k , for some 
k < I, thus generating Gk, the rank-fc Nystrom ap- 
proximation to G. In this case, the runtime of the 
algorithm is reduced to 0(7 3 + nlk). 



3 Low-rank, low-coherence bounds 

In this section, we make use of coherence to analyze the 
Nystrom method when used with low-rank matrices. 
We note that although the bounds presented through- 
out this section hold for matrices of any rank r, they 
are only interesting when r = o(yfn), and hence they 
are most applicable in the "low-rank" setting. 



2.2 Coherence 



3.1 Nystrom method bound 



Although the Nystrom method tends to work well in 
practice, the performance of this algorithm depends 
on the structure of the underlying matrix. We will 
show that the performance is related to the size of 
the entries of the singular vectors of G, or the coher- 
ence of its singular vectors. We define V r as the top 
r singular vectors of G, and denote the coherence of 
these singular vectors as ii(V r ) , which is adapted from 
Candes and Romberg! (|2007l) . 

Definition 1 (Coherence). The coherence of a matrix 
of V r with orthonormal columns is defined as: 



H(Yr. 



V" max | V r g | 



(5) 



The coherence of V r is lower bounded by 1, as is the 
case for the rank-1 matrix with all entries equal to 
1/y/n, and upper bounded by y/n, as is the case for 
th e matrix of canoni c al ba si s vectors. As di s cusse d 
in ICandes and Rechtl (l2009f ); ICandes and Taol (l2009f ). 



highly coherent matrices are difficult to randomly re- 
cover via matrix completion algorithms, and this same 
logic extends to the Nystrom method. In contrast, in- 
coherent matrices are much easier to successfully com- 
plete and to approximate via the Nystrom method, as 
discussed in Section [3J 



In ord er to provide some intuition, ICandes and Recht 
( 20091 ) give several classes of randomly generated ma- 
trices with low coherence. One such class of matrices 
is generated from uniform random orthonormal singu- 
lar vectors and arbitrary singular values. For such a 
class they show that fi = 0(y/\ogn ■ 4/r) with high 
probability!]] In what follows, we will show bounds on 
the number of points needed for reconstruction that 
become more favorable as coherence decreases. How- 
ever, the bounds are useful for more generous values 
of coherence than given in the above example. We 
will also provide an empirical study of coherence for 
various real-world and synthetic examples. 



1 For low-rank matrices, tfr is quite small. Moreover, 
this ^pr factor only appears due to our use of the gener- 
ally loose inequality /i 2 < y/r^ii, where is a slightly 
different notion of cohere nce used in the original bound in 
ICandes and Rechtl (|2009l ) for this class of matrices. 



The Nystrom method is empirically effective in cases 
where G has low-rank structure even if the matrix has 
full rank, i.e., G ~ Gk for some k <C n. Furthermore, 
as stated in Theorem Q] below, when G is actually a 
low-rank matrix, then the Nystrom method can ex- 
actly recover the initial matrix (we include the short 
proof for the sake of completeness). 

Theorem 1 f (|Kumar et all l2009bl ) Thm. 3). Sup- 
pose r = rank(G) < k <l and rank(VF) = r. Then the 
Nystrom approximation is exact, i.e., \\G — Gk\\F = 0- 

Proof. Since G = X T X, rank(G) = rank(A) = r. 
Similarly, W = Xj X\ implies rank(ATi) = r, i.e., the 
columns of X\ span the columns of X . We next let 
Uxi.k be the k left singular vectors of X\ associated 
with the top k singular values of X\. We then repre- 
sent W and G in terms of X\ and X2, to rewrite the 
Nystrom approximation as: 



G = CW+C T 



XJ 



X 1 (X?X 1 )+X? [ X\ x 2 ] 



x T u Xuk ul uk x. 



(6) 



Furthermore, since columns of X\ span the columns 
of X, Uxx.r is an orthonormal basis for X and I — 
Ux!,rUx i r is an orthogonal projection matrix into the 
nullspacc of X. Since k > r, from ^ we have 

\\G-G k \\ F = \\X T (I - U Xltk Ul uk )X\\ F = 0. (7) 



□ 



This theorem implies that if G has low-rank, then there 
exists a particular sampling such that rank(VF) = 
rank(G) and the Nystrom method can perfectly re- 
cover the full matrix. However, selecting a suitable 
set of I columns from an n x n SPSD matrix can be 
an intractable combinatorial problem, and there ex- 
ist matrices for which the probability of selecting such 
a subset uniformly at random is exponentially small, 
e.g., the rank-r SPSD diagonal matrices discussed ear- 
lier. In contrast, a large class of SPSD matrices are 
much more incoherent, and for these matrices, we will 



next show that by choosing I to be linear in r and 
logarithmic in n we can can with very high probabil- 
ity guarantee that rank(TV) = r, and hence exactly 
recover the initial matrix. 

Probability of choosing a good subset 

We start with a rank-r Gram matrix, G, and a fixed 
distribution, T>, over the columns of G. Our goal is to 
calculate the probability of randomly choosing a subset 
of I columns of G according to T> such that rank(IV) = 
r. Recall that G = X T X, X = [Xi X 2 ] and W = 
XjXi. Then, by properties of SVD, we know that 
rank(G) = rank(V) and rank(TV) = rank(Vi). Hence, 
the probability of this desired event is equivalent to the 
probability of sampling I columns of X according to T> 
such that rank(Xi) = r, as shown in (jTUJ) . Next, we 
can write the thin SVD of X as X = U r T^ r VJ , where 



U r G 



S r G 



and V r G 



Since U r 



contains orthonormal columns and E r is invertible, we 
know that 

T,- x UjX = Vj. (8) 
Further, using the block representation of X, we have 



X 1 U r 'E r 1 = V r ,l, 



(9) 



where V r ,i G R /Xr corresponds to the first I compo- 
nents for each of the r right singular vectors of X. 
Since rank(X 1 ) = vanak(XjU r T,~ 1 ), we obtain the 
equality of (JTTJ) . 



Pr[rank(lV) = r] 



Pr[rank(Xi) = r] (10) 



Pr[rank(K,0 



(11) 



In the next section we calculate this probability for a 
specific distribution in terms of I as well as a measure 
of the coherence of V r . 

Sampling Bound 

Given the orthonormal matrix V r , we would like to 
find a choice of I such that V r ,i created by uniform 
sampling has rank r with high probability. As pointed 
out in the previous section, a meaningful bound may 
not be possible for any I < n if no assumption is made 
on V r . Here we adopt the assumption that V r has low 
coherence, as defined in Definition [TJ We then observe 
that by properties of SVD we have 



Pr 



(rank(V r ,,) = r) = Pr (rank(V^V r ,,) = r) . (12) 



Next, we define a = HV^T^./lb and note that for < 
c < l/cr, cVjjVr^i is an I X I SPSD matrix with singular 
values less than one. Furthermore, / — cVjjV r ,i is also 



SPSD with 
Pr ( ran 



(rank(V r X0 = r) = Pr (\\cV^V r ,i - 1\\ < l) , 

(13) 

since ||cV^V^z — I\\ = 1 implies that the nullspace of 
cVJlV r> i is nonempty. Alternatively, if c > l/cr, then 



Pr (rank(V r TV r , i ) 



>Pr(\\cVZV r j-I\\<l 



(14) 

since, for large enough c, we could have ||cV^,;V^; — 
I\\ > 1 even if r&nkfV^Vr.i) = r. Thus the inequality 
in (jT4"|) holds for any constant c > 0, i.e., the probabil- 
ity on the RHS of (fT4"]l serves as a lower bound for the 
probability of interest to us. 



The probability on the RHS of ([14)) has been studied 
in pre vious compressive sam p ling l iterature. Specif- 
ically, ICandes and Romberd (2007 ) makes use of a 
main lemma of RudelsonT i 19991 ) to derive Theorem 
[21 which provides us with our desired lower bound. 



Theorem 2 ( (|Candes and Romberg) . 12007) ) Thm. 
1.2). Define V r G M. nxr such that V^V r = I and let 
V r j G M. lxr be generated from V r by sampling rows 
uniformly at random. Then, the following holds with 
probability at least 1 — S, 



I\\ < 



1 



for any I that satisfies, 

I > r M 2 (V r ) max (Ci Iog(r),C a log(3/<5)), 
where C\ and Ci are positive constants. 



(15) 



(16) 



Note that our definition of coherence and statement of 
Theorem [2] arc modified to acc ount for the fact that 
VjV r = I as oppose to nl, as in Candes and Romberg! 



(|2007t ). Also, V r is not square as assumed in the origi- 
nal theorem, however it can be verified that the proof 
holds even for this case. 

By making use of Theorem [2) we can now answer the 
question regarding the number of columns needed to 
sample from G in order to obtain an exact reconstruc- 
tion via the Nystrom method. Theorem [3] presents a 
bound on I for matrix completion in terms of (i. 

Theorem 3. Let G G R nxn be a rank-r SPSD matrix 
and assume r G 0(1/ 8), then it suffices to sample I > 
0(rfi 2 (V r )log(S~ 1 )) columns to have with probability 
at least 1 — 8, 

||G-G fc ||=0. (17) 

Proof. Theorem Q] states sufficient conditions for ex- 
act matrix completion. Equations ([TO)) and ()TT)) re- 
duce these sufficient conditions to a condition on the 
rank of V r> i. Equations (fT2")) and ([T4]) further reduce 



Dataset 


Type of data 


# Points (n) 


# Features (d) 


Kernel 


PIE fSim et ai, 2002) 


face images 


2731 


2304 


linear 


MNIST CLeCun and Cortes. 1998) 


digit images 


4000 


784 


linear 


Essential (Gustafson et ai, 2006) 


proteins 


4728 


16 


RBF 


Abalone (Asuncion and Newman. 2007) 


abalones 


4177 


8 


RBF 


Dexter (Asuncion and Newman. 2007) 


bag of words 


2000 


20000 


linear 


Artificial 


random features 


1000 


20000 


linear 



Table 1: A summary of the datasets used in the experiments, including the type of data, the number of points 
(n), the number of features (d) and the choice of kernel. 

Reconstruction Error Reconstruction Error 




Number of Sampled Columns (/) Number of Sampled Columns (/) 



Figure 1: Mean percent error over 10 trials of Nystrom approximations of rank 100 matrices. Left: Results for / 
ranging from 5 to 200. Right: Detailed view of experimental results for / ranging from 50 to 130. 



this problem to a similar problem previously studied 
in the context of compressed sensing. Finally, we use 
Theorem [2] to bound with high probability the RHS of 
(HU). □ 

The assumption of r € 0(1/ S) is used only to simplify 
presentation and avoid the appearance of a max term, 
as in (|16|) . Furthermore, although this assumption im- 
plies that the input matrix has low rank, the low rank 
setting is precisely the setting we are interested in. 

4 Experiments 

In this section we present a series of empirical results 
that show the empirical connection between matrix co- 
herence and the performance of the Nystrom method. 
We first perform two sets of experiments that corrob- 
orate the theoretical claims made in the previous sec- 
tion - Section 14.11 illustrates the performance of the 
Nystrom method for low-rank matrices using the six 
datasets detailed in TableQ]while Section l4~2l interprets 
these results in the context of the coherence of these 
datasets. Next, we present more general experimental 
results in Section |L3] that connect matrix coherence to 



the Nystrom method in the case of full rank matrices. 
4.1 Reconstruction error 

In our first set of experiments we measure the accu- 
racy of the Nystrom approximation (Gk) for a variety 
of rank-r matrices, with r = 100. For each of the six 
datasets listed above, we first constructed the optimal 
rank-r approximation to each kernel matrix by recon- 
structing with the top r eigenvalues and eigenvectors. 
Next, we performed the Nystrom method for various 
values of I to generate a series of approximations to 
our rank-r matrix (note that we set k = I). For each 
approximation, we calculated the percent error of the 
Nystrom approximation using the notion of percent 
error, defined as follows: 

Percent error = ^"fi^ x 100. (18) 

The results of this experiment, averaged over 10 trials, 
are presented in Figure [1] The figure shows that for 
five of the six datasets, the Nystrom method exactly 
reconstructs the initial rank r matrix when the number 
of sampled columns (I) is equal or slightly larger than 
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Figure 2: Coherence of Datasets. Left: Coherence of rank 100 SPSD matrices used in experiments in Section 
14.11 Right: Asymptotic growth of coherence for MNIST and Abalone datasets. Note that coherence values are 
means over ten trials. 



r. Note that this observation holds for each of the ten 
trials, since the mean error is zero for each of these 
datasets when I « r. In contrast, for the case of the 
Abalone dataset, we do not see convergence to zero 
percent error as I surpasses r, and the percent error is 
non-zero even when I = 2r. 

4.2 Coherence of datasets 

In this set of experiments, we use the concept of co- 
herence to explain the results from Scction[4T] namely 
that the Nystrom method generates an exact matrix 
reconstruction for I rj r for five of the six datasets, 
but fails to do so for the Abalone dataset. As such, we 
first calculated the coherence of each of the six SPSD 
rank 100 matrices used in Section |4~T1 using the defi- 
nition of coherence from Definition [1] The left panel 
of Figure [5] shows the coherence of these matrices with 
respect to the number of points in the dataset. This 
plot illustrates the stark contrast between Abalone 
and the other five datasets in terms of coherence, and 
helps validate our theoretical connection between low- 
coherence matrices and the ability to generate exact 
reconstructions via the Nystrom method. 

Next, we performed an experiment in which we repeat- 
edly subsampled the initial SPSD matrices to generate 
matrices with different dimensions, i.e., different val- 
ues of n. For each value of n, we computed the coher- 
ence of the subsampled matrix, again using Definition 
[TJ The right panel of Figure [5] shows the mean re- 
sults over ten trials for both the MNIST and Abalone 
datasets. As illustrated by this plot, the coherence of 
the Abalone dataset grows much more quickly than 
that of the MNIST dataset. As illustrated by the or- 



thogonal random model, we expect incoherent matri- 
ces to exhibit a slow rate of growth, i.e. 0(Vlog n). 
The plots for the other four datasets (not shown) are 
comparable to the MNIST dataset. These results pro- 
vide further intuition for why the Nystrom method is 
able to perform exact reconstruction on all datasets 
except for Abalone. 

4.3 Full rank experiments 

As discussed in Section [TJ the Nystrom method hinges 
on two assumptions: good low-rank structure of the 
matrix and the ability to extract information from a 
small subset of I columns of the input matrix. In this 
section, we analyze the effect of each of these assump- 
tions on Nystrom method performance on full-rank 
matrices, using matrix coherence as a quantification 
of the latter assumption. To do so, we devised a series 
of experiments using synthetic datasets that precisely 
control the effects of each of these parameters. 

To control the low-rank structure of the matrix, we 
generated artificial datasets with exponentially decay- 
ing eigenvalues with differing decay rates, i.e., for 
i G {l,...,n} we defined the ith singular value as 
(7; = cxp(— irj), where 77 controls the rate of decay. For 
a fixed value of 77, we then measured the percentage of 
the spectrum captured by the top k singular values as 
follows: 

Percent of Spectrum = f^f 1 ° l . (19) 

E<=i 

To control coherence, we generated singular vectors 
with varying coherences by forcing the first singular 
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Figure 3: Coherence experiments with full rank synthetic datasets, with n = 2000 and k = 50. Each plot 
corresponds to matrices with a fixed eigenvalue decay rate (resulting in a fixed percentage of spectrum captured) 
and each line within a plot corresponds to the average results of 10 randomly generated matrices with the 
specified coherence. Furthermore, results for each such matrix for a fixed percentage of sampled columns are the 
means over 5 random subsets of columns. 



vector to achieve our desired coherence and then us- 
ing QR to generate a full orthogonal basis. The small- 
est values of \x used in our experiments correspond to 
randomly generated orthogonal matrices. We report 
the results of our experiments in Figure [3] For these 
experiments we set n = 2000 and k = 50. Each plot 
corresponds to matrices with a fixed eigenvalue decay 
rate (resulting in a fixed percentage of spectrum cap- 
tured) and each line within a plot corresponds to the 
average results of 10 randomly generated matrices with 
the specified coherence. Furthermore, results for each 
such matrix for a fixed percentage of sampled columns 
are the means over 5 random subsets of columns. 

There are two main observations to be drawn from our 
experiments. First, as noted in previous work with the 
Nystrom method, the Nystrom method generates bet- 
ter approximations for matrices with better low rank 



structure, i.e., matrices with a higher percentage of 
spectrum captured by the top k singular values. Sec- 
ond, following the same pattern as in the low-rank 
setting, the Nystrom method generates better approx- 
imations for lower coherence matrices, and hence, ma- 
trix coherence appears to effectively capture the degree 
to which information can be extracted from a subset 
of columns. 

5 Conclusion and future work 

In this work, we make a connection between matrix co- 
herence and the performance of the Nystrom method. 
Making use of related work in the compressed sensing 
and the matrix completion literature, we derive novel 
coherence-based bounds for the Nystrom method in 
the low-rank setting. We then present empirical re- 
sults that corroborate these theoretical bounds. Fi- 



nally, we present more general empirical results for 
the full-rank setting that convincingly demonstrate 
the ability of matrix coherence to measure the de- 
gree to which information can be extracted from a 
subset of columns. Future work involves developing 
algorithms to efficiently estimate the coherence of a 
dataset to help quickly determine the applicability of 
the Nystrom method on a case-by-case basis as well as 
generalizing our coherence-based bounds to the case of 
full rank matrices. 
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